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Abstract 

Probabilistic principal component analysis (PPCA) seeks a low dimensional representation of a data set 
in the presence of independent spherical Gaussian noise, £ = a I. The maximum likelihood solution for 
the model is an eigenvalue problem on the sample covariance matrix. In this paper we consider the situation 
where the data variance is already partially explained by other factors, e.g. covariates of interest, or temporal 
correlations leaving some residual variance. We decompose the residual variance into its components through 
a generalized eigenvalue problem, which we call residual component analysis (RCA). We show that canonical 
covariates analysis (CCA) is a special case of our algorithm and explore a range of new algorithms that arise 
from the framework. We illustrate the ideas on a gene expression time series data set and the recovery of 
human pose from silhouette. 



1 Introduction 

Probabilistic principal component analysis (PPCA) decomposes the covariance of a data point, y, into a low 
rank term and a diagonal noise term. The underlying probabilistic model assumes that each datum is Gaussian 
distributed, 

y~A/"(0,WW T + (T 2 I), 

where we assume the data is centred such that its mean is zero and W 6 ?R. dxq imposes a reduced rank structure 
on the covariance (q < d — 1). The log likelihood of the centered data set with n data points, Y € 3?™ xd , 



logp(Y) = log AA(y,- : |0, WW T + a 2 I) 



can be maximized | Tipping and Bishop 1999 1 with the result that W = U g L g R T , where JJ q are the q principal 



eigenvectors of the sample covariance, S = n _1 Y T Y, L 9 is a diagonal matrix with elements £ i A = W A, — a\, 
where A; is the zth eigenvalue of the sample covariance, R is an arbitrary rotation matrix, and a 2 the noise 
variance. As a result the matrix W spans the principal subspace of the data and the model is known as principal 
components analysis. Underlying this model is an assumption that the data set can be represented by 



Y = XW T + E 



where X G 5R" X9 is the matrix of g-dimensional latent variables and E is a matrix of noise variables, each 
element being independently sampled from a zero mean Gaussian with variance a 2 . The marginal likelihood 
above is obtained by placing an isotropic prior independently on the elements of X, Xij ~ Af(0, 1). 
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Lawrence [Lawrence 



2005 1 showed that the PCA solution is also obtained for log likelihoods of the form 



logp(Y) = ^logAA(y :J |0,XX T + a 2 I) 

3=1 

which is recovered when we marginalize W with an isotropic prior instead of X. This is a duaQform of prob- 
abilistic PCA which could also be called probabilistic principal coordinate analysis as the maximum likelihood 
solution solves for the latent coordinates, X = U^LR 1 ", instead of the principal subspace. Here XJ' q are the 
first q principal eigenvectors of the inner product matrix YY T with L defined as before. Note in this case that 
the Gaussian density is independent across data features rather than data points. So the correlation is expressed 
between data points. The underlying model is in fact an product of independent Gaussian processes [Rasmussen 
and Williams, 2006 1 with linear covariance functions. 

Both of these scenarios involve maximizing log likelihoods of a similar structure, namely the covariance 
of the Gaussians is given by a low rank term plus a spherical term, XX T + <r 2 I (dual scenario). In this paper 
we consider an alternative form where the covariance is given by XX T + S, where £ is a general positive 
definite matrix. Our motivation is that our data has already been partly explained by the covariance matrix £ 
and we wish to study the components of the residuals. Our ideas can be applied in both the primal and dual 
representations: the form to be used depends on what information we wish to include in S. 



As a motivating example consider a linear additive model (Figure 1(a) I, 

Y = XW T + ZV T +E, (1) 

where Z is a matrix of known covariates that are assumed to have some predictive power for Y and X is a 
matrix of unknown confounders (as in standard PPCA). Also consider that Y could be a set of n patients' 
gene expression measurements (d genes), Z could be the genotype of each patient and X could be unobserved 
environmental confounders (see |Author|). We can marginalize out V with a Gaussian prior, Vij ~ A/"(0, a) as 



e | Author]) 

well as W with a prior Wij ~ j\f(0, l)F]and recover 

d 

logp(Y) = ]T log7V(y : j|0, XX T + E), 

3 = 1 

where for this example £ = aZZ T + er 2 I. 

Given £, can we solve for X? As we will show, the solution for X is given by a generalized eigenvalue 
problem. By using different forms for £ we can formulate different models. For example, for a particular 



choice of £ we recover canonical correlates analysis (CCA, see Section 2.1 1. In the next section we show how 
the low rank term can be optimized for general £. The only constraints that we place on £ are that it should be 
positive definite and invertible. 



2 Optimizing the Likelihood 

The log likelihood for the RCA model is given by 

L(X, £) = - d - In |K| - ^tr(YY T K- 1 ) - V± l n ( 27r ), (2) 

where we have defined K = XX T + £. We now take the eigendecomposition of £, 

£ = UAU T , (3) 

l As opposed to the typical primal form. Refers to the duality between the data-space (row-space) and the coordinate-space (column- 
space) of a design matrix, with data-samples as its rows. 

2 There is no loss of generalisation by using a standard Gaussian prior here, since the functional form of Y's distribution remains 
unchanged for a general Gaussian prior. 
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where U T U = I and A is a diagonal matrix. We now project the covariance onto this eigenbasis scaling with 
the eigenvalues, 

K = A-5U T KUA"5 =a4u t XX t UA"HI. (4) 

This allows us to define 

K = XX T + I, where X = A"JU T X, (5) 

and also implies the inverse 

K 1 = AiU T K _1 UA5. (6) 

Now we note from eq. |4]) that 

|K| = |K||A|, 

and from eq. (|6]l that 

tr(YY T K" 1 ) = tr(A~5U T YY T UA~5K~ 1 ), 
leading us to define Y = A~sU T Y so that we can rewrite the entire likelihood from eq. ^ as 

L(X, A, U) = - din A, - \ In |K| - ^tr(YY T K -1 ) - ^ Ih(2tt). 

i=i 

We know how to maximize this new likelihood with respect to X. Following a similar route to the maximum 
likelihood solution proof in | Tipping and Bishop 1999| , we take the gradient of the likelihood with respect to 
X 

BL 

^ = K 1 YY T K 1 X - K _1 X = 0, to give the stationary point YY T K" 1 X = X. (7) 
By singular value decomposition on X, we get 

X = VLR T , (8) 

then by substituting X in eq. and eq. ^ 

VLR T = YY T (VL 2 V T + I) _1 VLR T 
V(L 2 + 1) = YY T V, 

where we make use of the Woodbury matrix indentity. Now we see that maximisation relies on a regular 
eigenvalue problem of the form 

YY T V = VD, where 

(9) 

D^(L 2 +I). 



We now express this eigenvalue problem in terms of YY T . Substituting X = UA 2 X and by eq. (j8j), we 
get a decomposition of X with the same singular values as X 

X = UA5VLR T = TLR T , (10) 

where we have defined V = A~2U T T. Substituting for Y and V in the eigenvalue problem from eq. |9l, 
recovers the eigenvalue problem in the original dual-space 

A-3U T YY T UA" 1 U T T = A~3U T TD 

YY T S 1 T = TD 

which follows from the inverse of eq. ([3]). 
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So far, T is solved via a non-symmetric eigenvalue problem. Assuming that S is positive-definite (i.e. 
invertible), we define S = S _1 T and get 

YY T S = SSD 

which is in the desired form of a generalized eigenvalue problem. Now we can recover X, up to rotation 
(R = I), via the first q generalised eigenvectors of YY T and eq. ( 10 1 

X = TL = SSL = SS(D - I)i 

Due to the algebraic symmetry between our dual and primal formulations of the log-marginal likelihood, 
we can easily extend our derivations to the primal representation. For example, in the linear model in eq. ([T}), 
the maximum likelihood solution of W is computed through 

Y T YS = SSD, where S = «VV T + a 2 I and W = ES(D - I)*. (11) 



2.1 Equivalence to CCA 



Canonical covariates analysis is solved through a generalized eigenvalue problem | |De Bie et al. 2005 Bach 
land Jordan| [20021 . 



which can be rewritten as 
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(A + I). 



A few notes on CCA: The left-most block matrix is the sample covariance matrix of the joint (augmented) 
design matrix C = n" 1 ^, Y 2 ) T (Y!, Y 2 ) and Cu = n^YjYx, C 22 = n - 1 YjY 2 , C u - 
n _1 Y ] r Y 2 are the individual sample covariances and cross-covariance of Yi, Y 2 . The diagonal matrix of 
generalised eigenvalues, A, contains the canonical correlations. The generalised eigenvectors, made up of 
direction-pairs Si,S 2 , are the canonical-directions or coefficients in data-spaces 3^1,3^2 respectively. They 
maximise the correlation between a projection Y1S1 of features of Yi and a projection Y 2 S 2 of features of 
Y 2 , 

Tr " 12 S 2 = P, cT ' 



sl c 



such that S] 1 C u S a = S 2 ' C 22 S 2 = I, 



where P is a rectangular matrix with the canonical correlations on its diagonal. These projections are known as 
the canonical variates. 

To show the equivalence of RCA to CCA, we turn our attention to the generalised eigenvalue problem of 



RCA in eq. (Hi and consider the case where 







Y 2 T Y 2 



Then by inspection, the generalised eigenvectors S of RCA become the canonical directions and (D — I) be- 
comes the diagonal matrix of canonical correlations. JBach and Jordan 2005 1 showed that the CCA maximum 



likelihood solutions for Vx , V 2 in the graphical model of Figure 1 (b) (again, for centred y! , y 2 ) 



■AT 



Vi 



6i 

e 2 



are Yi = CnS^ A^ 2 R and V 2 = C 22 S 2 v; Ag /z R, where ei and e 2 are full noise covariance matrices, S^ ; 
and S 2 are the first q canonical directions, A q is the diagonal matrix of the first q canonical correlations and 



A 1/2 



the arbitrary rotations R = I. These maximum likelihood solutions are equivalent to W in eq. (11 1, when 



S = Cn, S = S^ 9) for V 1 and S = C 22 , S = S^ for V 2 



_ «(«) 
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(a) (b) (c) 

Figure 1 : (a) Graphical model of RCA. Z partially explains the variation in Y through the sub-space defined in V. The 



residual covariance is spanned by W up to some noise variance of e. [(b)] Graphical model of PCCA. Z is latent and shared 
between Yj and Y2. The standard linear approach to estimating Vi, V2 and Z is via CCA, which turns out to be a special 
case of RCA (cf. section \2A\ . |(c)| In a multi-view learning context, Yi, Y2 have shared (Z) and private (Xi, X2) latent 
components. We illustrate an iterative-RCA approach for inference in this type of model. 



Similarly, we get the PCA eigenvalue equation when S = I and the PPCA solution emerges as 

W = S(D-I)* =U,(A,-I)*. 

We notice a subtle difference from the PPCA formulation here. Whereas PPCA explicitly subtracts the noise 
variance from the q retained principal eigenvalues, RCA already incorporates any noise terms in S and stan- 
dardises them while projecting the total covariance onto the eigenbasis of X, see eq. Q. 

From the RCA perspective, CCA can be seen as setting S to be block diagonal, with each block containing 
the sample covariance matrix associated with the data. The residual components then represent the variance 
which isn't explained by those two sample covariances: i.e. the correlation between the two data sets. Residual 
components analysis is much general than this though, by alternative choices for X we can explore other residual 
components. To demonstrate this we now consider two case study data sets. The first is a gene expression 
experiment containing treatment and control, our objective will be to explore the differences between treatment 
and control. The second is a data set of human pose and silhouette [Agarwal and Triggs 2006 1 . Our objective 
is to predict the pose given the silhouette and we find a set of components which we can project the data on to 
achieve this. 



3 Case Study 1: Differences in Gene Expression Profiles 

A common data analysis challenge is to summarize the difference between treatment and control. To illustrate 
how RCA can help, we consider two gene expression time series of cell lines. The treatment cells are targeted 
by TP63 introduced into the nucleus by tamox ifen. The control cells are simply subject to tamoxifen alone. 
The data used for this case study come from ]Della Gatta et al. 2008 j^] The treatment group (Yi) contains 



ni = 13 time points of d = 22,690 gene expression measurements, whilst the control group (Y2) contains 
only n% = 7 time points. This complexity of data (with different numbers of time points and non-uniform 
sampling) is typical of many bio-medical data sets. The challenge is to represent the differences between the 
gene expression profiles for these two data sets. Canonical correlates analysis could be applied but this would 
represent the similarities between the data not the differences. Our approach is as follows. First we assume that 
both time series are identical, that would imply that they could be modeled (for example) by a Gaussian process 
with a temporal covariance function, 



•AA(0,K) 



3 Data is available on the Gene Expression Omnibus (GEO) database, under accession number GSE10562. 
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Figure 2: (a) An RBF covariance computed on the augmented time-input vector for the microarray experiment. The covari- 
ance is computed across the times for the control and the treatment, t = (0 : 20 : 240, 0, 20, 40, 60, 120, 1 80, 240), with 



bandwidth parameter i — 20 and noise variance a\ — 10 4 
Lawrence H2011| for details on an alternative approach based on Gaussian processes 



(b) ROC comparison against BATS, see also 



Kalaitzis and 



where the matrix of the covariance function, K, is computed as if both y x and y 2 were from the same function. 
Now, if we study the residual components, they will be forced to explain how the two time series are actually 
different. In other words we model the data through the dual paradigm with a covariance of the form 



XX 1 



K 



and solve to find the residual components X. We used a squared exponential covariance (or RBF kernel) for 
K whose elements were k(U, tj) — exp(— 0.5£~ 2 (t; — tj) 2 ). The parameters of the covariance function could 
be optimized, but for simplicity we set £ = 20 which provided a bandwidth roughly in line with the time point 
sampling intervals. We also added a small noise term along the diagonal of K which was set to 1% of the data 
variance. 

We project the profiles onto the eigenbasis of the first q generalised eigenvectors 

Y' = S^ T Y 

and obtain a score of differential expression based on the norms of their projections. The number q of 
retained principal eigenvectors is decided on the number of corresponding eigenvalues larger than one. Recall 
in PPCA (cf. page[l} that as the assumed noise variance of> increases, more eigenvalues become negative and 
less eigenvectors are retained in the solution of W. On a similar note, RCA standardises any positive-definite 
noise (cf. eq. Q), so we always have to test for eigenvalues larger than 1 . Here, the assumed noise variance 
embedded in the kernel drives the effective number of eigenvectors in the projection. 

We rank the scores and compare to a noisy ground truth list of binding targets of TP6^]from flDella Gatta 
et al. 2008 1, giving the ROC performan ce curve in Fi gure|2(b) The baseline method that we compare against is 
a Bayesian hierarchical model, B ATSj^j | Angelini et al. 2007 1. We notice that RCA outperforms BATS in terms 
the area under the ROC curve. 



4 Case Study 2: Iterative RCA for Prediction of Pose from Silhouette 

Probabilistic canonical correlates analysis explains two related data sets by assuming a full covariance block 
diagonal form and low rank off diagonal terms. Ek et q/ JEk et ah] J2008 1 introduced a model with both a shared 

4 A gene with a high number of binding sites for TP63 is a strong candidate for being one of its direct targets (i.e. associated with 
TP63 related diseases). The ranking list of direct targets is available at genom e, cshlp . org/content/ suppl/2008/05/05/gr . | 
[073601 . 107 .DCl/DellaGatta_SupTablel .xls] 

■' The software of Bayesian Analysis lor Time Series is available at |http : / /www .na.iac. cnr . it/bat s/index_f ile/| 
download . htm 
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latent space and private latent spaces for explaining data specifically associated with the two data sets. The 



graphical model is shown in Figure 1(c) Each partition of the data space, Yi and Y2 has its own associated 
latent space, Xi and X2 as well as a shared latent space, Z which corresponds to the standard shared latent 
space found in CCA. The advantage to a model of this structure is that if the variance that is particular to each 
partition of the data is low dimensional, this will be recovered. The partitions of the data are therefore modeled 

as 

Yi = XiWjT + ZVj + ci, where e x ~ Af(0, ofl) 

and 

Y 2 = X 2 Wj + ZVj + e 2 where e 2 ~ Af(0, ofl). 

Each set of latent variables can be marginalized through an isotropic Gaussian prior, z ~ Af(0, 1), leading to a 
covariance structure for the concatenated data set of the following form 

*_(W 1 Wj \ ,/V x V7 V t Vj\ 
^ W 2 WJ) + \V 2 Vj V 2 VjJ + {0 o 2 2 l 



Setting 



W 2 Wj ) + \ all 



allows Vi and V 2 to be optimized using the RCA algorithm. To optimize Wi and W 2 we note that the 
marginal covariance for Yi is Cn = WiW^ + ViV^ + ofl, so Wi can be optimized by RCA using 
S = ViV^ + ofl. A similar optimization can be done for W 2 . 

The data we consider come from Agarwal and Triggs [Agarwal and Triggs, 2006]. They produced a set 
of 3D human poses and associated silhouettes. The silhouettes are summarized by a d 2 — 100 dimensional 
vector of HoG features in matrix Y 2 € dt nxd ' 2 . There are n = 1, 927 frames. There are 21 points in each pose 
representation each containing x, y, z coordinates leading to d 2 = 63 for Yi <E 3?" xdl . The data is generated 
by the Poser computer software, therefore it is "noise free". To better reflect real world scenarios we added a 
small amount of Gaussian noise to each feature. 

One issue with this iterative RCA algorithm is that 3 latent dimensionalities need to be chosen. However, 
similar to probabilistic PC A, if the noise values, a\ and a\ are fixed, the latent dimensionality will be determined 
automatically. We therefore set the noise variances to a proportion, a, of the data variance. We used this fraction 
to control the dimensionality, varying it between and 1 . This gave us only one parameter in the model to vary. 
The algorithm converges when the log-marginal likelihood between two iterations differs no more than a small 
constant. The prediction of pose from silhouette can be computed through p(Yi | Y 2 ). The mean of this density 

Algorithm 1 Iterative RCA 

C <- Y T Y, C n <- n^YjY^ C 22 <- n~ 1 YjY 2 

Initialize a E [0,1], a\ <- £ tr(C u ), a\ <- f tr(C 22 ), (Wi, W 2 , Vi, V 2 ) <- 
repeat 

Compute Wi by CuWi = (V^V^ + ofl)WiAi 
Wi (ViVj" + of I) W[ q) (\[ q) - I) 3 

Compute W 2 by C 22 W 2 = (V 2 Vj + a\ I)W 2 A 2 
W 2 ^ (V 2 Vj+all)W^(A^-I)l 

S ^ { W 2 Wj+ollJ' C ° mpUte V by CV = SVA 

V <- SV(«)(A(«)-I)3, V x <- V 1:dl , : , V 2 <- V (dl+mdl+d2) , 
until the log-marginal likelihood converges 

is given by 

Yi = ViVjCWaWj + 0D-V2 + in, 
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Figure 3: Comparison of iterative RCA for a shared latent space with standard CCA and a linear regression model, (a) 
Iterative RCA against standard probabilistic CCA with root mean square errors for reconstruction of the pose. The figure 
shows the error for varying q (i.e. the latent dimensionality) in PCCA and varying a (i.e. the proportion of explained 
variance) in RCA. Linear regression also yields an RMS = 3.2098. |(b)| Latent dimensionalities on convergence, of Xi, X2 
and Z (cf. Figure [T(cj] >, for varying a. |(c)| Shows the silhouette and the pose predictions, alongside ground truth for frame 
#404, which was the frame in the test set with the largest error. 



where //i is the sample mean of Yi. Variances can also be computed, but aren't used in our experiments. 

Comparison of Iterative RCA with varying a, to PCCA with varying q, yields the root mean square (RMS) 
errors illustrated in Figure [3(a)] Iterative RCA outperforms standard PCCA in general with the smallest differ- 
ence in performance being at q = 18 for PCCA and a = 0.3. The RMS error of RCA is robust for a wide range 
of large a values. An interesting aspect of iterative RCA is the self-regularity that the algorithm imposes on the 
latent dimensionalities of the shared and private components, see Figure 3(b) As the noise increases with a, 
the eigenvalues decay faster from Z and X2 than from Xi. Other approaches to selecting the dimensionality 
of the latent spaces could also be followed, but the approach of explaining a proportion of the variance with the 
noise seems simple and satisfactory. 



5 Discussion 

We have introduced residual component analysis: an algorithm for describing a low dimensional representation 
of the residuals of a data set given partial explanation by a covariance matrix £. With imaginative application 
our algorithm allows for novel approaches to data analysis. We illustrated this with the characterization of the 
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difference between a treatment and control time series and an algorithm for fitting a low dimensional variant of 
CCA. Other forms of £ that could be of interest include one with a sparse inverse. Sparse inverse structures 
capture relations between variables that are not well characterized by low rank forms. As such, the combination 
of sparse inverse and low rank could be a powerful one. Finally a form which reflects class structure in the data 
would also allow the exploration of components of the data which were not related to the class structure. 
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